14 research outputs found

    Discovery of low-dimensional structure in high-dimensional inference problems

    Full text link
    Many learning and inference problems involve high-dimensional data such as images, video or genomic data, which cannot be processed efficiently using conventional methods due to their dimensionality. However, high-dimensional data often exhibit an inherent low-dimensional structure, for instance they can often be represented sparsely in some basis or domain. The discovery of an underlying low-dimensional structure is important to develop more robust and efficient analysis and processing algorithms. The first part of the dissertation investigates the statistical complexity of sparse recovery problems, including sparse linear and nonlinear regression models, feature selection and graph estimation. We present a framework that unifies sparse recovery problems and construct an analogy to channel coding in classical information theory. We perform an information-theoretic analysis to derive bounds on the number of samples required to reliably recover sparsity patterns independent of any specific recovery algorithm. In particular, we show that sample complexity can be tightly characterized using a mutual information formula similar to channel coding results. Next, we derive major extensions to this framework, including dependent input variables and a lower bound for sequential adaptive recovery schemes, which helps determine whether adaptivity provides performance gains. We compute statistical complexity bounds for various sparse recovery problems, showing our analysis improves upon the existing bounds and leads to intuitive results for new applications. In the second part, we investigate methods for improving the computational complexity of subgraph detection in graph-structured data, where we aim to discover anomalous patterns present in a connected subgraph of a given graph. This problem arises in many applications such as detection of network intrusions, community detection, detection of anomalous events in surveillance videos or disease outbreaks. Since optimization over connected subgraphs is a combinatorial and computationally difficult problem, we propose a convex relaxation that offers a principled approach to incorporating connectivity and conductance constraints on candidate subgraphs. We develop a novel nearly-linear time algorithm to solve the relaxed problem, establish convergence and consistency guarantees and demonstrate its feasibility and performance with experiments on real networks

    Clustering and Community Detection with Imbalanced Clusters

    Full text link
    Spectral clustering methods which are frequently used in clustering and community detection applications are sensitive to the specific graph constructions particularly when imbalanced clusters are present. We show that ratio cut (RCut) or normalized cut (NCut) objectives are not tailored to imbalanced cluster sizes since they tend to emphasize cut sizes over cut values. We propose a graph partitioning problem that seeks minimum cut partitions under minimum size constraints on partitions to deal with imbalanced cluster sizes. Our approach parameterizes a family of graphs by adaptively modulating node degrees on a fixed node set, yielding a set of parameter dependent cuts reflecting varying levels of imbalance. The solution to our problem is then obtained by optimizing over these parameters. We present rigorous limit cut analysis results to justify our approach and demonstrate the superiority of our method through experiments on synthetic and real datasets for data clustering, semi-supervised learning and community detection.Comment: Extended version of arXiv:1309.2303 with new applications. Accepted to IEEE TSIP

    Connected subgraph detection with mirror descent on SDPs

    Full text link
    We propose a novel, computationally efficient mirror-descent based optimization framework for subgraph detection in graph-structured data. Our aim is to discover anomalous patterns present in a connected subgraph of a given graph. This problem arises in many applications such as detection of network intrusions, community detection, detection of anomalous events in surveillance videos or disease outbreaks. Since optimization over connected subgraphs is a combinatorial and computationally difficult problem, we propose a convex relaxation that offers a principled approach to incorporating connectivity and conductance constraints on candidate subgraphs. We develop a novel efficient algorithm to solve the relaxed problem, establish convergence guarantees and demonstrate its feasibility and performance with experiments on real and very large simulated networks.http://proceedings.mlr.press/v70/aksoylar17a.htmlhttp://proceedings.mlr.press/v70/aksoylar17a/aksoylar17a.pdfhttp://proceedings.mlr.press/v70/aksoylar17a/aksoylar17a.pdfPublished versio

    Sparse Signal Processing with Linear and Nonlinear Observations: A Unified Shannon-Theoretic Approach

    Full text link
    We derive fundamental sample complexity bounds for recovering sparse and structured signals for linear and nonlinear observation models including sparse regression, group testing, multivariate regression and problems with missing features. In general, sparse signal processing problems can be characterized in terms of the following Markovian property. We are given a set of NN variables X1,X2,…,XNX_1,X_2,\ldots,X_N, and there is an unknown subset of variables S⊂{1,…,N}S \subset \{1,\ldots,N\} that are relevant for predicting outcomes YY. More specifically, when YY is conditioned on {Xn}n∈S\{X_n\}_{n\in S} it is conditionally independent of the other variables, {Xn}n∉S\{X_n\}_{n \not \in S}. Our goal is to identify the set SS from samples of the variables XX and the associated outcomes YY. We characterize this problem as a version of the noisy channel coding problem. Using asymptotic information theoretic analyses, we establish mutual information formulas that provide sufficient and necessary conditions on the number of samples required to successfully recover the salient variables. These mutual information expressions unify conditions for both linear and nonlinear observations. We then compute sample complexity bounds for the aforementioned models, based on the mutual information expressions in order to demonstrate the applicability and flexibility of our results in general sparse signal processing models.Comment: Final version submitted to Trans. I

    External Language Model Integration for Factorized Neural Transducers

    Full text link
    We propose an adaptation method for factorized neural transducers (FNT) with external language models. We demonstrate that both neural and n-gram external LMs add significantly more value when linearly interpolated with predictor output compared to shallow fusion, thus confirming that FNT forces the predictor to act like regular language models. Further, we propose a method to integrate class-based n-gram language models into FNT framework resulting in accuracy gains similar to a hybrid setup. We show average gains of 18% WERR with lexical adaptation across various scenarios and additive gains of up to 60% WERR in one entity-rich scenario through a combination of class-based n-gram and neural LMs

    Bir Türkçe konuşma tanıma sisteminin anatomisi (The anatomy of a Turkish speech recognition system)

    Get PDF
    In this paper, we present a Turkish speech recognition system we have developed. We will be giving information about the building of our system and tests we conducted on it. SUVoice voice database, along with METU 1.0, was used in the training of our acoustic models. We performed limited vocabulary and large vocabulary recognition tests. We have shown the performance of modernhidden Markov model based systems for Turkish speech recognition. For simple tasks, we can obtain 1% word error rate, whereas for large vocabulary tests the error rate is higher. This work will constitute a basis for more advanced future works in Turkish speech recognition and contribute to the accumulated knowledge in the field

    Sparse Signal Processing With Linear And Non-Linear Observations: A Unified Shannon Theoretic Approach

    No full text
    In this work we derive fundamental limits for many linear and non-linear sparse signal processing models including group testing, quantized compressive sensing, multivariate regression and observations with missing features. In general, sparse signal processing problems can be characterized in terms of the following Markovian property. We are given a set of N variables X1, Χ2,..., Xn, and there is an unknown subset of variables S ⊂ {1, 2,..., N} that are relevant for predicting outcomes/outputs Y. In other words, when Y is conditioned on {X n}nS it is conditionally independent of the other variables, {Xn}nS. Our goal is to identify the set S from samples of the variables X and the associated outcomes Y. We characterize this problem as a version of the noisy channel coding problem. Using asymptotic information theoretic analyses, we establish mutual information formulas that provide sufficient and necessary conditions on the number of samples required to successfully recover the salient variables. These mutual information expressions unify conditions for both linear and non-linear observations. We then compute sample complexity bounds for the aforementioned models, based on the mutual information expressions. © 2013 IEEE
    corecore